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Abstract 

A typical Web Search Engine consists of three principal parts: 
crawling engine, indexing engine, and searching engine. The present 
work aims to optimize the performance of the crawling engine. The 
crawling engine finds new Web pages and updates Web pages existing 
in the database of the Web Search Engine. The crawling engine has 
several robots collecting information from the Internet. We first calcu- 
late various performance measures of the system (e.g., probability of 
arbitrary page loss due to the buffer overflow, probability of starvation 
of the system, the average time waiting in the buffer). Intuitively, we 
would like to avoid system starvation and at the same time to minimize 
the information loss. We formulate the problem as a multi-criteria opti- 
mization problem and attributing a weight to each criterion we solve it 
in the class of threshold policies. We consider a very general Web page 
arrival process modeled by Batch Marked Markov Arrival Process and 
a very general service time modeled by Phase-type distribution. The 
model has been applied to the performance evaluation and optimiza- 
tion of the crawler designed by INRIA Maestro team in the framework 
of the RIAM INRIA-Canon research project. 



1 Introduction 

The problem of control by the robots (crawlers) that traverse the Web and 
bring Web pages to the indexing engine that updates the data base of a 
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Web Search Engine is formulated and analyzed in [13]. This problem is 
formulated in [13] as the controlled queueing system. The system has a 
single server with the exponential service time distribution, finite buffer of 
capacity K — 1, K > 2. There are N available robots and each of these 
robots, when activated, brings pages to the server in a Poisson stream at 
fixed rate. These N stationary Poisson processes are mutually independent 
and independent of service times. 

The number of active robots may be modified at any arrival or departure 
epoch. When an arrival occurs, the incoming robot is de-activated at once; 
the controller may then decide to keep it idle or to activate it. When a 
departure occurs the controller may either decide to activate one additional 
robot, if one is available, or to do nothing (i.e. the number of active robots 
is left unchanged). 

In [13] , the problem of finding a policy that minimizes a weighted sum 
of the loss rate and starvation probability (probability of the empty system) 
is considered. It is solved by means of the tools of the Markov Decision 
Problems theory. 

As the possible generalizations of the model, which are certainly worth- 
while analyzing, the following ones are mentioned in [13] : 

• More general input processes, e.g., a MMPP [Markov Modulated 
Poisson Process) should be considered so as to reflect more accurately 
"traveling times" of robots in the network; 

• Because of the obsolescence of stored documents issue, the waiting 
time should be bounded, even if the buffer size is effectively infinite; 

• Other cost functions could be investigated, for instance, cost functions 
including response times. 

In this paper, we made all the mentioned and some further generaliza- 
tions. 

We assume that, under the fixed number of currently active robots, the 
arrival process is of the BMAP type. The BMAP is a more general process 
comparing to the MMPP and allows delivering of a batch of Web pages to 
be indexed while the MMPP assumes that the pages are delivered one-by- 
one. It is very typical for a computer system to operate in batch mode. 

We assume that the service time distribution is of the PH (Phase) type 
which is much more general comparing to the exponential distribution as- 
sumed in [13]. The class of phase type distributions is dense in the field of 
all positive- valued distributions and practically we can deal with any real 
distribution [2]. 
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Since web pages can become obsolete, we bound the waiting time stochas- 
tically. Waiting time of each web page in a buffer is restricted by a random 
variable having PH distribution identical and mutually independent for all 
Web pages. The phase type distribution has been used to model obsoles- 
cence times for instance in |15j . 

We suppose that the cost function can have a more general form than in 
[13j and include the obsolescence probability and response time. 

In the next section we formulate the model and optimization problem. 
Section 3 contains the steady-state analysis of the multi-dimensional Markov 
chain which defines dynamics of the system under the fixed values of the 
parameters defining the strategy of control. In Section 4, main performance 
measures of the system are computed. In Section 5, the conditional sojourn 
time distributions are calculated. In Section 6, the case of ordinary arrivals 
is touched in brief. In Section 7, the theoretical results are illustrated by 
numerical examples. In particular, the mathematical model is applied to the 
performance evaluation and optimization of the robot designed by INRIA 
Maestro team in the framework of the RIAM INRIA-Canon research project. 
Section 8 concludes the paper. 



We consider a single server system with the finite buffer of capacity K — 
1, K > 2. So, the total number of Web pages which can stay in the system 
is restricted by the number K. Web pages are served by a server in order of 
their arrivals. 

Service times of Web pages are independent identically distributed ran- 
dom variables having PH distribution with irreducible representation ((3, S). 
It means the following. Service of a Web page is defined as a time until the 
continuous-time Markov chain mt, t > 0, having the states (1, . . . , M) as the 
transient and state as absorbing one reaches the absorbing state. An initial 
state of the chain is selected in a random way, according to the probabil- 
ity distribution defined by the row- vector (/3,0), where /3 is the stochastic 
row vector of dimension M. Transitions of the Markov chain mt, t > 0, 



generator and the column vector So is defined by So = —SeM and has all 
non-negative and at least one positive components, &m is the column vector 
of dimension M consisting of all l's. The average service time b\ is given 
by b\ = (3(— S) eM- For more details about the PH type distribution, its 



2 Mathematical Model 



are described by the generator 




where the matrix S is a sub- 
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properties, special cases and applications see [TTj [T2]. 

Web pages can be delivered into the system by N available robots. The 
number of active robots varies in the set {1, . . . , N}. We assume that the 
process of Web pages delivering under I, I = 1, N, active robots is described 
as follows. Let u^, t > 0, be an irreducible continuous time Markov chain 
having finite state space {0, 1, . . . , W}. Sojourn time of the chain v t , t > 0, 
in the state v has exponential distribution with a parameter \„ . After this 
time expires, with probability pjj v') the chain jumps into the state v' 
without generation of Web pages and with probability p^) (is, v') the chain 
jumps into the state v' and a batch consisting of k Web pages is generated, 
k > 1. The introduced probabilities satisfy conditions: 

oo W W 
it=li/'=0 u'=0 

The parameters defining this flow are kept in the square matrices T>9 , k > 
0, I = 1, N, of size W = W + 1 defined by their entries: 

{V®) v , v = -A« (V$\y = X^poiv, v'), (1) 

(Vf) vy = AWpW^, */), z,, i/ = O/W 7 , fc > 1, Z = MT. 

Denote 

oo 
fc=0 

The matrix is the infinitesimal generator of the process ut,t > 0, 

under the fixed number I of active robots. The stationary distribution vector 
0® of this process satisfies the equations 6^T>^(1) = 0,6^e = 1. Here 
and in the sequel, is the zero row vector. The average intensity A® 
(fundamental rate) of the BMAP under the fixed number I of active robots 
is defined by 



A m g m <*P ( °(*) 



=ie, 



and the intensity Ag of group arrivals is defined by 

A (O=0(O(_2)(O )e . 

The variance of intervals between group arrivals is calculated as 
v (i) = 2(A('))-ifl(0(_p(0)-i e _ (A('))-2, 
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while the correlation coefficient (y C or of intervals between successive group 
arrivals is given by 

c« = -4-fA(O «(_p« r i (P(1) (O _p(0 )( _p(0 )-i e _ a 

The introduced representation of the arrival process via the matrices 
T^i i k > 0, Z = 1,N, unifies several possible interpretations of the process 
of Web pages delivered by the fixed number of active robots: 

1. The processes of Web pages delivering by all robots are independent 
BMAP processes. Let the process of Web pages delivering by the Zth 
robot be the BMAP which is governed by the continuous time Markov 
chain uf , t > 0, having finite state space {0, 1, . . . , W{\ and defined 
by the matrices Dl\ k > 0, of size Wi = Wi + 1. See [TU] for more 
details about the BMAP, its properties and special cases. We denote 

oo 

D®(z) = £ Dfz k , \z\ < 1. 

fc=0 

Let us assume that the robots are arranged in such a way that the 
first robot is always active, then, when a queue decreases, the second 
robot can be activated, etc, the A^th robot is the most rare activated. 

The matrices V%' , k > 0, I = 1,N, of size W = U W k defined 

k=l 

by formulae (1) are expressed via the matrices k > 0, of size 

Wi, I = 1, N, describing the BMAPs in the following way: 

p« = D w e e • • • e ® ®---® dW(i), 

vV = dW®dW®---®dP®i Wi+i ®...®i Wn , k>l. 

Here <g> and © denote Kronecker product and sum of matrices cor- 
respondingly (see, e.g., [7]), II denotes identity matrix of size L. If 
the size of the matrix is clear from context the suffix can be omitted. 

T d - f 1 
J — I- 

2. The common process of Web pages delivered by all robots together 
is the BMAP process directed by the continuous time Markov chain 
ft-, t > 0, having finite state space {0, 1, ... , W} and defined by the 
matrices k > 0, of size W. Some set of thinning probabilities 
qi,...,qN, < qi < ■ ■ ■ < = 1, is fixed. When I robots are 
active, procedure of thinning the BMAP process with the thinning 
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probability qi is applied. It means that an arbitrary arriving batch is 
accepted with probability qi and is rejected with the complimentary 

oo 

probability 1 — We denote D(z) = DkZ k , \z\ < 1. 

k=0 

The matrices T>^\ k > 0, I = 1,N, defined by formulae (1) are ex- 
pressed via the matrices D^, k > 0, describing the common BMAP 
and via the thinning probability qi in the following way: 

V$ = D ogi + £>(1)(1 - «), VP = D k qi, k > 1. 

3. Let the process of Web pages delivering by robots is described by 
a BMMAP. The BMMAP is directed by continuous time Markov 
chain u t , t > 0, having finite state space {0, 1, ... , VF}. Sojourn time 
of the chain u t , t > 0, in the state v has exponential distribution with 
a parameter \ v . After this time expires, with probability po(v, v') the 
chain jumps into the state v' without generation of Web pages and 
with probability p^\v,v') the chain jumps into the state v' and a 
batch consisting of k, k > 1 Web pages are delivered by the Ith. robot. 
The introduced probabilities satisfy conditions: 

N oo W W 
1=1 k=l u'=0 u'=0 



The matrices , k > 0, I = 1,N, defined by formulae (1) are ex- 
pressed via the matrices Do, k > 1, I = 1,N, formed by the 
probabilities po(v, v) and p^{y, v') in the following way: 

oo TV / 

*f = + E E ^ (m) ' ^° = E 4 m) ^ > i- 

j=l m=i+l m=l 

4. The process of Web pages delivery by all robots is the BMAP pro- 
cess directed by the continuous time Markov chain i/ t , t > 0, having 
finite state space {0, 1, ... , W} and defined by the transition intensity 
matrices T>^\ k > 0, I = 1,N, depending on the number of active 
robots. 

Interpretation 3 seems be the most attractive because it assumes that the 
work of the robots can be dependent, which is quite realistic. Because the 
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total amount of Web servers from which new pages should be brought is more 
or less constant, reduction of the number of active robots causes the increase 
of field in Internet, which is scrawled by each robot, and corresponding 
change of travel time. So, the BMMAP looks to be the most realistic 
model of Web pages delivery. 

If a batch of delivered Web pages meets free server one Web page starts 
the service immediately while the rest moves to the buffer. If the server is 
busy at an arrival epoch, all Web pages of the batch are placed into the 
buffer if there is enough free space in the buffer. If the number of free 
places in the buffer is less than the number of Web pages in the batch, 
the corresponding number of Web pages is lost. It means that we consider 
so called partial admission strategy. The alternative strategies of complete 
rejection or complete admission can be investigated in analogous way. 

For each Web page placed into the buffer, the waiting time is restricted 
by the random variable (so called obsolescence time) having PH distribu- 
tion with irreducible representation (7, V). It means the following. Available 
waiting time of the ith Web page in the buffer is defined as a time until the 
continuous-time Markov chain rf\ t > 0, having the states (1, . . . , R) as the 
transient and state as absorbing one, reaches the absorbing state. Transi- 
tion of this process into the absorbing state means that this Web page gets 
out of date (obsolescence or dashout occurs). An initial state of the chain is 
selected in a random way, according to the probability distribution defined 
by the row- vector (7,0), where 7 is the stochastic row vector of dimension 

(i) 

R. Transitions of the Markov chain r t , t > 0, are described by the genera- 



ls is defined by Tq = —Te. The average time until obsolescence g\ is given 
by 9i = 7(- r )~ le - 



If the obsolescence time expires before a Web page is picked-up from the 
buffer to the server, it is assumed that this Web page immediately leaves 
the buffer and is lost. The obsolescence times of different Web pages are in- 
dependent of each other and identically distributed. It is worth to note that 
the analysis presented below could be drastically simplified if we suggest 
that the obsolescence time is exponentially distributed. However, this sug- 
gestion rarely holds true in the real world systems because this suggestion 
means that, with high probability, information obsoletes very quickly. 

Reasonable class of strategies of control by robots is the class of the 
threshold strategies defined as follows. Integers ji,... , Jn-i are fixed such 
as — 1 = jo < ji < ■ ■ ■ < jn-i < 3n = K. If the number i of Web pages in 




where the matrix T is sub-generator and the column vector 



7 



the system satisfies inequality j r -i + 1 < z < j r > then N — r + 1 robots are 
active and r — 1 robots are de-activated, r = 1,N. 

Note that the described threshold strategies are popular in literature in 
controlled queues, see, e.g., [TJ [3j El [H] . For some systems, it is proven 
that the optimal strategy in the class of all Markovian strategies belongs to 
the class of threshold strategies. For some other systems such a result is not 
proven, but the optimal strategy is sought in the class of threshold strategies. 
Advantage of such strategies is their intuitive justification and relative sim- 
plicity of implementation in real-life systems. Numerical examples presented 
in the paper [13J for a partial case of our model confirm that the thresh- 
old strategies are optimal in the class of all Markovian strategies, although 
authors cannot prove this fact. Our system is much more complicated and 
we also cannot prove optimality of the optimal threshold strategy in wider 
classes of strategies. We just try to find an optimal threshold strategy and 
believe that it is optimal or sub-optimal in wider classes as well. 

We also mention that the description of the given above threshold strat- 
egy suits only for the case when N < K. While the numerical examples 
presented in jI3] address, e.g., the case TV = 16 and K = 5. However, if we 
look at the optimal strategy given by Figure 2 in |13j . we see that the opti- 
mal number of the active robots varies between 4 and 14 with 4 switching 
points where two or three robots are activated or de-activated. Because, in 
contrast to [13], we do not assume that each active robot generates a sta- 
tionary Poisson process of arrivals at a fixed rate but we assume that each 
robot generates a batch Markovian arrival process, we see that our strategy 
suits for the case N > K as well. We could achieve this formally by allowing 
non-strict inequalities: — 1 = jo < j% < ■ ■ ■ < Jjv-l < 3N = K in fixing the 
thresholds. So, speaking below about a robot we may think about a virtual 
robot as a group of several available real robots which are activated and 
de-activated simultaneously. 

We will solve the problem of choosing the optimal threshold strategy. 
The cost function is assumed to be of the following form: 

J = K c lossPloss + C obs Pobs) + aV^ + C rob N act + CgtarPstar (2) 

where A is the mean number of pages delivered into the system by robots 
during a unit of time (fundamental rate of the arrival process) , Pi oss is prob- 
ability of arbitrary page loss due to the buffer overflow, P G fc s is probability of 
arbitrary page obsolescence during waiting in a queue, P s tar is probability 
of starvation of the system, is the response time (average sojourn time 
of Web pages which are not lost or deleted due to obsolescence) , N act is the 
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average number of active robots, q oss , c b s , a, c ro b, c s t ar are the correspond- 
ing non-negative cost coefficients. The values of the cost coefficients can be 
set up by experts in the domain. Alternatively, the cost coefficients can be 
viewed as Lagrange multipliers in the constrained problem and can be found 
from the dual problem formulation. 

It is clear that if the average number of active robots is increasing then 
the three first summands in (2) are increasing while the last summand, 
charge for starvation of the system, is decreasing. The last charge is im- 
portant because starvation means that the indexing machine is idle and so 
freshness of the data base suffers. 

The problem of minimization of the cost criterion (2) is not trivial. To 
solve this problem, we will use so called direct approach. To this end, we will 
calculate the stationary distribution of the system state under an arbitrary 
fixed set of thresholds . . . , jjv-i)- It will allow us to calculate the main 
performance measures of the system and the value J of the cost criterion as 
a function J(ji, . . . , Jn-i)- Problem of finding the optimal set (jj, . . . , jj^-i) 
is then easy solved on computer, e.g., by enumeration. 

3 Stationary distribution of the number of Web 
pages in the system 

Let some set of thresholds (ji, . . . , jN-i) be fixed. We are interested in the 
stationary distribution of the process it, t > 0, where it is the number of Web 
pages in the system at the epoch t, it = 0, K. This process is non-Markovian. 
To investigate this process, we will consider the following multi-dimensional 
continuous time Markov chain 

6 = {n^umt,^, ■ ■ ■ ,4 H ~ 1) }, t > 0, 

where v t is the state of the directing process of arrival process at epoch t, 
v t = 0, W; mt is the state of the process which directs a service at epoch t, 

(i) 

mt = r t 1S the state of the process, which directs a obsolescence of 

(i) 

the iih Web page in a queue at epoch t, — 1, R, i = 0, K — 1. 

We assume that the Web pages in the buffer are numerated in order of 

their arrival into the system. If a batch of Web pages arrives to the system, 

the accepted Web pages are numerated in a uniform random manner. When 

a Web page is picked up to the service or is deleted from the queue because 

its admissible waiting time expires, the rest of Web pages is immediately 

enumerated correspondingly. 
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Denote 



p(0, v) = lim P{i t = 0,u t = u}, 

t— >oo 

p(l, v, m) = lim P{i t = l,v t = v, m t = m}, 

t— ¥00 



(3) 



lim P{it = i,v t = v, m t = m, r\ = r\, 



p(i, v,m,ri, . . . , J"j_i) 
(i) 



Ti-i}, i>2. 



Because the state space of the the Markov chain £ t , t > 0, is finite and due 
to assumption about irreducibility of the processes defining arrival, service 
and obsolescence processes, limits (3) exist. 

Enumerate the states of the Markov chain £ t , t > 0, in the lexicographic 
order and form the probability row vectors Pi,i = 0,K, of probabilities 
corresponding to the state i of the first component of the process > 0. 
Denote also p = (po, . . . , Pk)- 

Let Q be the infinitesimal generator of the Markov chain £ t ,t > 0, and 
Qi^i be the block of the generator Q consisting of enumerated in lexico- 
graphic order intensities of transition of the Markov chain > 0, from 
the states with the value i of the component it to the states with the value 
i' of this component, i, i' > 0. Dimension of the block Qi^ is defined by 
(WM a *R b ') x (WM a i'R b i') where a t = min{Z, 1}, b t = max{(l - 1),0}, I = 

Lemma. Non-zero blocks Qi^ of the infinitesimal generator Q of the 
Markov chain £ t ,t > 0, are defined by 



Qo, 



i i 9®7®0'-i) j j 



K-l 



(#)(i)- e pr j )®/9®7 0(x_1) , j = ^, 

r=0 



Qi,i 



_ J % 



S , i = 1, 

i > 1, 



Qi,i 



v, 



(x(0) 



.4 



i-l, « 



1,K-1, 



P. 



(x(0) 



7®', / = 1,K 



1, i < K 



i > 0. 



(X>(x(0)(i)_ £ ^ xW) )®7 MiJi -i®7®', l = K-i, i<K, 

r=0 
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Here 



Bi = S p ®e R ® I Ri -x + I M ® r§*, Ai = S® T ei , i = 0, K - 1, 

7 ®j*/ 7(8 ... 8)7i r ffi 'f r ®,..©r, r® l ^ / r ©...©r , *>i, 

l i i 

and the value wmc h corresponds to the number of active robots when 
z Web pages stay in the system, is calculated by = N — ip(i) where the 
value ip(i), = 0, N — 1, is defined by relations 

_7>(i) + 1 < i < i = 0,K. 

Proof of the lemma is implemented by analyzing the probabilities of 
transition of the multi-dimensional Markov chain > 0, under the fixed 
set of thresholds (ji, . . . , JjV-i) during an interval of infinitesimal length. It 
is rather transparent and is omitted. 

It is well-known that the vector p of stationary probabilities satisfies the 
following equilibrium equations 

P Q = 0, pe = l. (4) 

Dimension of the vector p is equal to W(l + M R „_ i -1 )- It can be rather 
high. For instance, ifW A = M = i? = 2 this dimension is equal to 2 K+l — 2. 
For K = 10 this equals to 2046. So, direct solution of equations (4) by 
"brute force" can be very time and computer memory demanding. 

Effective and numerically stable algorithm for computing the blocks 
Pi,i = 0,K, of the vector p, which exploits a structure of generator Q 
and is presented in [3], consists of the following steps: 

1. Compute the matrices Gi, i = 0, K — 1, from the backward recursion 

K-i-l 

Gi = (— — Qi+\j+\+\Gi+iGi+i-\ . . . Gi+i) 1 Qi+i : i, 
i=i 

i = K - 1,K -2,...,0. 

2. Compute the matrices Qj ; from the backward recursions 

Qi,K = Qi,K,i = 0,K, 
Q i ,i = Qi,l + Q i ,i+iG h i = 0ri,l = K-l,K-2,...,0. 
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3. Compute the matrices Fi, I = 0, K, by recursion 

l-i 

F = I, F = J2 FiQi,i{-Qi,iT\ I = MT. 

j=0 

4. Compute the vector po as the unique solution to the following system 
of linear algebraic equations: 

K 

PoQo,o = 0, poJ^-Fje = 1. 

1=0 



5. Compute the vectors pi, i = 1, K, by formula 

Pi = p F i} i = l,K. 



Thus, the problem of computing the stationary distribution p^, i = 0, K , of 
the considered queueing system under the arbitrary fixed set of the thresh- 
olds (ji, . . . ,jn-i) can be considered solved. 

4 Main Performance Measures of the System 

As the main performance measures of the system we consider the values 
A, Pi oss , P obs , P s tar, N act which appear in cost criterion (2). 

Calculation of the values P s tar, N act is the easiest. 

Theorem 1. Probability P s tar of the system starvation (idle state of the 
indexing machine) is calculated by 

Pstar = VO^W- ( 5 ) 

Average number N ac t of active robots is calculated by 

N 

N act = ^2nip n , (6) 

n=l 

where probability <p n that n robots are active at an arbitrary epoch is com- 
puted by 

V?n = Pi e WM^Rh,n = l,N. (7) 

i=jN-n+^ 
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Theorem 2. Probability Pi oss of an arbitrary Web page loss due to the buffer 
overflow is calculated by 

N-l jn+i K-i 

P/ oss = l- X E £ Pi£(*-* + 0pr~ n) ®WOe, (8) 

n=0 i=j n +l k=0 

where the average intensity A of the input flow is computed by 

AT— 1 jn + l oo 

A =£ £ P*£^f~ n) ®WOe. (9) 

n=0 i=j n +l k=l 

Proof. It follows from the formula of total probability that 

K oo 



p l0SS = i-Y,Y, p ? )p ** {i ' k) > ( 10 ) 

i=0 k=l 



(k) 

where P£ is probability that there are i Web pages in the system at an 
epoch of arrival of a batch consisting of k Web pages, is probability that 
arbitrary Web page arrives in the batch consisting of k Web pages, 
is probability that the arbitrary Web page will be accepted into the system 
conditional that it arrives in the batch consisting of k Web pages and i Web 
pages present in the system at the epoch of arrival. 

It can be shown that the listed probabilities are calculated by 

p(k) = Pi(Pk*~ n) ® lM°iB>>i) e (n) 

1 N-l jm+1 (N ) ' 

m=0 r=j m + l 



jn + 1 < i < jn+l, n = 1, N - 1, 

N-l jn + l , 

E E ViKK >® 

P k = "=°^" +1 , (12) 

E E p*E^ ( ^^W^e 

n=0 i=j' n +l i=l 

\ k>K-i, 
i = 0j(,k> 1. 
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By substituting expressions (11) - (13) into formula (10) we get formulae 
(8), (9). The theorem is proven. 

Theorem 3. Probability P i, s of an arbitrary Web page obsolescence is 
calculated by 

1 K 

Pobs = t ® !m ® if (,_1) )e. (14) 

i=2 

Probability P SUC cess of an arbitrary Web page successful service in the system 
is calculated by 

1 K 

Psuccess = J E P*(% ® S ® I R i ~ 1 ) e - ( 15 ) 
i=l 

The statement of the theorem is clear because the right hand side of (14) 
represents the ratio of the obsolescence rate and arrival rate into the system. 
The right hand side of (15) represents the ratio of the rate of successfully 
served in the system Web pages and arrival rate into the system. 



5 Sojourn Time Distribution 

Let V(x), V\{x), and V^(x) be distribution functions of sojourn time of an 
arbitrary Web page in the system under study, an arbitrary Web page, which 
will get successful service, and arbitrary Web page, which will be deleted 
from the system due to its obsolescence, and v(u), v^(u) and v^ 2 \u) be 
the corresponding Laplace-Stieltjes transforms: 

oo oo 

v(u) = J e~ ux dV{x), v^ k \u) = J e- ux dV k (x), k = 1,2, Re u > 0. 



Theorem 4- Laplace-Stieltjes transforms v(u), v^\u) and v^ 2 \u) are 
calculated by 

V (u) = V (1 \u) P success + (u)P obs + Pl oss , (16) 
K-l oo 



= ~p~~ E X>^>), (17) 

^success • n ; 1 
t=0 k=l 

K-l oo 

- (2 V) = ^EEp,4 2 M (18) 



i=0 k=l 
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where 

Pi,k = -Pik(V { * {l)) e w ® I M a iRbi ), 
the column vectors vj^ (u),i = 0, K — 1, fc > 1, m = 1, 2, are computed by 

min{fc,A"— i} 

vS ) («)= E ^irv£?-i(«). m = l,2, (19) 

C^ = /3^®/ Ma , R6l ® 7 0(/ - 5l ' o) , 
the vectors v^(u), i = 0, K — 1, m = 1, 2, are computed recursively by 

v£ ) (u) = (uI-S)- 1 So, (20) 

Vl (1) (u) = («J - A i )~ 1 B i - 1 \f\{u), i = 1,K-1, 

•i-i i-i 

v< 2) (u) = 0, vf } (u) = £ " A-kT^i-k-iX 

1=0 fc=0 

x(uJ- A-i) -1 ^®- 7 *-'- 1 ® r o)e, i = l,K-l, (21) 

where 

Bi = Bi® I R , i = 0,K-1. 
( 1 i = I 

Sij is Kronecker delta: (5^; = < q ^ ^ 

Proof. To prove the theorem, we use the method of collective marks 
(method of catastrophes). We interpret the parameter u as an intensity 
of some imaginary stationary Poisson flow of catastrophes. So, v(u) has 
the meaning of probability that no catastrophe arrives during the sojourn 
time of an arbitrary Web page, v^\u) is probability that no catastrophe 
arrives during the sojourn time of an arbitrary Web page conditional that 
this Web page will get service successfully, and v^ 2 \u) is probability that 
no catastrophe arrives during the sojourn time of an arbitrary Web page 
conditional that this Web page will be deleted from the system due to its 
obsolescence. 

So, formula (16) evidently stems from the formula of total probability. 

It is obvious that sojourn time of the arbitrary (tagged) Web page does 
not depend on the arrival process after the tagged Web page arrival epoch. 
Thus, in analysis we can ignore transitions of the directing process of the 
arrival process after the epoch of the tagged Web page arrival. 
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Formulae (17) and (18) also follow from the formula of total probability. 
Here row vector pj j. defines probability of an arbitrary Web page arrival 
at the moment when there are i Web pages in the system in a batch of 
size k and probability distribution of the directing processes of service and 
obsolescence at this moment. Column vector vj^(u) defines probability of 
no catastrophe arrival during the conditional sojourn time of a tagged Web 
page who arrives at the moment when there are i Web pages in the system 
in a batch of size k under the fixed value of the directing processes of service 
and obsolescence at the arrival epoch. For m = 1 condition is that the 
tagged Web page will get service successfully. For m = 2 condition is that 
the tagged Web page will be deleted from the system due to its obsolescence. 

Relation (19) is based again on the formula of total probability. The 
matrix defines probability distribution of installation, upon the arrival 
epoch of a tagged Web page, of the initial states of the directing process of 
service (if i = 0, i.e., the system was empty at the arrival epoch) and of the 
directing processes of obsolescence of the Web pages, which arrive at the 
same batch as the tagged Web page and are placed in a buffer before the 
tagged Web page, and of this Web page. Recall that we assume that if the 
batch consists of k Web pages then the tagged Web page will be the Ith in 
the batch, I = l,k, with probability i, k > 1. 

The vector vj m ^ (u) defines probability of no catastrophe arrival during 
the conditional sojourn time of a tagged Web page who sees i Web pages 
in the system before him in a queue and the corresponding states of the 
directing processes of service and obsolescence after the moment of arrival. 

Recurrent formulae (20) for the vector Laplace-Stieltjes transforms (u) 
is clear if we take into account that: (i) Vq 1 ^ (u) is the vector Laplace-Stieltjes 
transform of the service time distribution, (ii) [ul — A) -1 defines probabil- 
ity that no catastrophe arrives during the time interval between the epoch 
of a tagged Web page arrival, at which the number of Web pages in a queue 
before the tagged Web page is equal to i, and the epoch when the number 
of Web pages in a queue before the tagged Web page is decreased to i — 1, 
(iii) the matrix Bi defines transition of the directing processes of service and 
obsolescence of the Web pages at the epoch of decreasing. 

(2) 

Formulae (21) for the vector Laplace-Stieltjes transforms v) (u) takes 
into account reasonings (ii) , (iii) presented above as well as consideration that 
no catastrophe should arrive until the obsolescence moment of a tagged Web 
page and that /, I = 0,i — 1, Web pages can depart from the system until 
the obsolescence moment. The theorem is proven. 

Corollary 1. Average sojourn time (response time) V of an arbitrary 



16 



Web page, average sojourn time of an arbitrary Web page who will get 
successful service, and average sojourn time of an arbitrary Web page 
who will be deleted from the system due to its obsolescence are computed 
by 

V = ^ P success + V^P obs , (22) 



- success 

K-l oo 



i\ — 1 oo 



success 



i=0 k=l 



K-l oo 



(23) 
(24) 



i=0 k=l 



where the column vectors w,-^ = — — \ u =o, i = 0,K — 1, k > 1, m = 



U,k 

mhi{k,K— i} 



1,2, are computed by 

'ff= E ^wH lim =l,2, 



w 



(25) 



z=i 



the vectors wj m \ i = 0,K, m = 1,2, are computed by 



W? = -S-^M, wf 1 = -(A)' 1 



vWcoj+Bi-iw^ 



(i) 



,i = 1,^-1, 



v^(0) = e M , vf >(0) = -(AJ-'Bi-ivji^CO), i = l,iv - 1, 



i-1 



■ i-l m-1 



wS 2) = -E(-!)' EIK^i 

x(A-m) 2 &i-m-l ] [ (A-fc)~ 1 ^i-A:-H 
fc= 

+ n(A-fc)" 1 B<-fc-l(A-l)" 1 



i-l 



fe=m+l 

(A-j) -1 ^^-'- 1 t»r )e M , 



fc=0 



i = l,K -1. 

Proof of corollary evidently follows from the well-known expression for the 
mean value of a random variable via the derivative of the Laplace-Stieltjes 
transform of its distribution function. 

Note that expressions for higher order moments and variance on sojourn 
time distribution can be also easily derived based on equations (16)- (18). 
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6 Case of the Ordinary Arrival Process 



Consider the special case when Web pages arrive not in batches, but one- 
by-one. It means that V® = 0, k > 2, I = 1,N. In this case the generator 
Q of the Markov chain £ t , t > 0, is the three block diagonal matrix which in 
turn means that this chain is a finite space Quasi-Birth-and-Death-Process. 
Thus, in this case the algorithm for solving equilibrium equations (4) for 
the vector p of stationary probabilities and formulae for some performance 
measures simplify. The algorithm has the form 

1. Compute the matrices Gi, i = 0, K — 1, from the backward recursion 

Gi = [— + Qi+i,i+2Gi + i)] 1 Qi+i,i, i = K — 2, K — 3, . . . , 0, 
with the terminal condition 

Gr-i = -{Qk,k)~ x Qk,k-\- 

2. Compute the matrices F{, i = 0, K, by recursion 

Fo = I, Fi = Fi-iQi-i t i[— (Qi t i + Qi t i+iGi)] 1 ,i = l,K. 

3. Compute the vector po as the unique solution to the following system 
of linear algebraic equations: 

K 

Po(<5o,o + Qo,iG ) = 0, po F i e = 1 - 

1=0 

4. Compute the vectors pi, i = 1, K, by formula 

Pi = p Fi, i = l,K. 

Formula for loss probability Pi oss is given by 

Pioss = ^p^(^ 1} ®I MR K-i)e, 

where 

N-l jn+l 

X =Y1 Pi(v[ N ~ n) ® I M a iRbi )e. 

n=0 i=j n +l 



18 



Laplace-Stieltjes transforms v^(u) and v^ 2 \u) are computed by 



v (1 Hu) = ^pM^ ) e w ®I MaiRbi )C i , 1 vW(u), 



^Psuccess ^ , 



XPobs i=0 



Average sojourn times and are computed by 



K-l 



7 Numerical example 

To demonstrate feasibility of the developed algorithms for calculating the 
stationary state distribution of the system under the fixed parameters of 
the control strategy and calculating the optimal set of these parameters, 
let us consider numerical examples. First we suppose that the system can 
have any number of active robots between one and four at any time moment 
(N = 4) and the buffer capacity is equal to 5 (K = 5). 

We assume that the arrival process is formed according to Model 4 from 
Section 2. Namely, when r robots are activated the i?MAP-input is de- 

(r) 

scribed by the matrices Df. , k > 0, r = 1, N, given by 
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(1) _ f-W 2 \ (i) _ /0.05 3 \ (i) _ A).05 4.9 \ 
U ° ~ V -0.5) ' 1 ~ \0.29 0.005 J ' 2 " V 0.2 0.005 J 

(2) _ f-5.65 1.1 \ (2) _ (-0.01 2.5\ ( 2 ) _ /-0.02 0.5 
^° " V -0.85,/ "VO.25 j'^ 2 "V0.25 

(2) _ /0.02 1.5 \ (3) _ (-2.48 0.48 
3 " V 0.1 0.25^ ' ~ V 0.48 3.48 

(3) _ A.5 \ (3) _ (0.5 
Ul ~\0 2.2b) '^ 2 "VO 0.75 

(4)_/-1.45 0.45 \ D (4)_/0.25 
V ° ~ V 0.6 -2.6,/ ' 1 " V 0.5 

D (4)_ D (4)_^0 0\ (4)_/0.75 
^ 2 " \0 0) " V 1.5 

The intensities A* 7 ") of the BMAP when r robots are active, r = 1,N, 
are computed by 

A (1) = 1.28, A (2) = 2.41, A (3) = 3.125, A (4) = 4.64. 

(r) ■ (r) 

The coefficients of correlation Ccor and the intensities of batches arrival X g 
are the following: c { cX = -0.218, c$ r = -0.111, cfX = 0.02, dtl = 0.035, 
A< 1} = 0.853, A^ } = 1.208, Aj 3) = 2.5, A^ 4) = 1.43. 

Let the PH distribution of service time be defined by the row vector 

(3 = (0.4 0.6) and sub-generator S = ^ ^ \^ . The mean service time is 

equal to 0.657. 

Let the PH distribution of a Web page obsolescence time be described 

by the row vector 7 = (0.3 0.7) and sub-generator r = ^ ^'^ ^Q^j • ^ ne 

mean time until obsolescence is equal to 5. 

Let the cost coefficients be fixed by q oss = 5, c t, s = 10, a = 2, c ro j, = 20, 
Cstar = 300. We have chosen the cost coefficients in this way to obtain com- 
mensurable optimal values in the optimal solution and non-trivial optimal 
policy. 

Note that in this example we fixed the cost coefficients based on some 
heuristic reasonings or common sense. In general, the right choice of the cost 
coefficients is the important and difficult task. It requires good knowledge 
of the real world system, which is described by the mathematical model 
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under study, and clear understanding what is the most undesirable for the 
concrete system (loss of the delivered Web pages, obsolescence of a page, 
starvation of the system, long waiting in the queue, keeping to many robots 
be active, etc), what is less important. So, the help of experts is required 
for the right choice of the cost coefficients. If such a choice does not seem be 
possible, some alternative formulation of optimization problem, e.g., multi- 
criteria problem or problem with constraints may be considered. In the latter 
approach the cost coefficients are Lagrange multipliers in the constrained 
problem and can be found from the dual problem formulation. 

Let us find the optimal strategy of control by the system under the 
fixed above values of the system parameters and the cost coefficients. The 
thresholds ji, ■ ■ ■ , jjv-i in the problem formulation are fixed such as — 1 = 
jo < ji < ■ ■ ■ < Jn-i < 3n = K, i.e., the thresholds cannot coincide and the 
use of all N modes of operation (the number of the mode is characterized 
by the number of the active robots) is mandatory. It is intuitively clear 
that actually it can happen that the optimal strategy does not need to use 
some modes of operation at all. So, to find the optimal strategy, we have 
to compare the values of the optimal values of the cost criterion when all iV 
modes are used, when N — 1 modes are used while 1 mode is ignored, . . . , 
two modes are used while N — 2 modes are ignored, when only one mode is 
used (i.e., the number of the active robots is not varied). 

Let us denote by C r the value of the cost criterion when exactly r robots 
are always active, r = 1,4. The values C r , r = 1,4, are given by C\ = 149.91, 
C 2 = HO.O, C 3 = 89.405, C 4 = 130.312. So, if there is no possibility 
to control the number of robots, and one has to decide how many robots 
should permanently work, the best choice is to have permanently three active 
robots. 

Next we consider the threshold type strategies for controlling the number 
of active robots. Table 1 contains the optimal value of the cost criterion for 
various combinations of the used modes and the optimal threshold strategy 
for each such a combination. 

Table 1: The Value of the Optimal Cost Criterion for various threshold 

strategies 
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possible numbers of active robots 


Optimal 
thresholds 


Optimal value 
of the cost 
criterion 


1 




i /in m 
14y.yl 


o 
Z 




110.0 


3 




89. 4U 


A 

4 




loU.ol 


o 1 

z or 1 


Z 


i no c a 

103.54 


^ or 1 


o 




4 or 1 


1 


74.47 


3 or 2 


2 


76.21 


4 or 2 


1 


86.13 


4 or 3 





94.14 


3 or 2 or 1 


2,2 


63.54 


4 or 2 or 1 


1,2 


73.69 


4 or 3 or 1 


0,2 


80.50 


4 or 3 or 2 


0,2 


80.50 


4 or 3 or 2 or 1 


0,2,2 


67.52 



Let us explain the entries of the table. For instance, the highlighted 
line corresponds to the control with one threshold at two. When the queue 
length does not exceed two, there are three active robots and when the 
queue length exceeds two, the number of active robots decreases to one. As 
another example, the control corresponding to the last line of the table has 
the following structure: when the system is empty, four robots are active; 
when the queue length is greater than zero and smaller than three, three 
robots are active; when the queue length exceeds two, only one robot is 
active. 

As it is seen from Table 1, the optimal strategy assumes that only two 
among four available operation modes (modes with one and three active 
robots) should be used. The optimal value C* of the cost criterion is equal 
to 63.54. It is evident that C* gives the relative profit R more than 28% 
comparing to the case without control. The value of R is computed as 

The dependence of the cost criterion on the threshold when the strategy 
of control uses only two modes, for all possible combinations of the modes, 
is shown on Figure 1. 
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140 




threshold 

Figure 1: Dependence of the cost criterion on the threshold 

The obtained results illustrate the necessity of the input control and pos- 
sibility to reduce the cost of the system operation by means of the threshold 
type control. 

Now let us vary the buffer size K from 1 to 100. Table 2 contains the 
optimal criterion value for the cases with and without control (C* and C r , 
r = 1,4), the optimal threshold j* and the relative profit R for different 
buffer capacity K. Note that in all cases the optimal control only uses 
the modes with one and three active robots. The results for K > 30 are 
approximately the same as for K = 30. 
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Table 2: Variable Buffer Size K 



K 


3* 


C* 


Ci 


c 2 


c 3 


c 4 


R, % 


1 





147.5 


244.7 


233.4 


187.2 


258.8 


21.0 


2 


1 


96.8 


199.2 


174.0 


128.8 


194.4 


24.8 


3 


2 


79.1 


172.6 


140.3 


105.4 


160.0 


24.9 


4 


2 


68.3 


158.1 


121.7 


94.7 


140.6 


27.85 


5 


2 


63.5 


149.9 


110.0 


89.4 


130.3 


28.9 


6 


3 


60.8 


144.7 


102.3 


86.7 


124.1 


29.8 


7 


3 


59.3 


141.6 


97.2 


85.5 


120.5 


30.6 


8 


3 


58.4 


138.5 


93.7 


85.0 


118.3 


31.2 


9 


3 


57.8 


137.9 


91.3 


84.9 


117.1 


31.8 


10 


3 


57.5 


137.0 


89.7 


85.0 


116.5 


32.3 


20 


3 


57.2 


137.0 


86.3 


87.2 


120.0 


33.7 


30 


3 


57.2 


137.0 


86.3 


87.4 


123.6 


33.7 



Let the mean service time b\ be varied by means of the matrix S mul- 
tiplication by the value s which varies from 0.1 to 15. Table 3 contains the 
value s, the mean service time b\, the optimal set of possible numbers of 
active robots, the optimal value of the threshold, the optimal cost criterion 
value and the value of the cost criterion when only one of the modes is in 
use (when the number of active robots is fixed), and the relative profit R. 
Note that when s is equal to or grater than 15, i.e. b\ is less than 0.04, no 
dynamic control is required. 

Now we vary the mean time g± until obsolescence by the same way as it 
was done for the mean service time. Table 4 shows obtained results. Note 
that in the optimal set of modes only one and three active robots are used 
in all considered cases. 

Consider the effect of the service time variation on the cost criterion 
value given that the mean service time b\ is constant and equals to 0.657. 
Let the matrix S of service time distribution have the form 




and vector (3 = (0.9 0.1). 

The variance of service time is calculated as 

var s = 2/3S~ 2 e - (/^-S^e) 2 . 

To maintain the mean service time b\, the entries a.\ and a 2 of matrix S 
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must be related through the formula b\ = (3(—S) as 

O.lai 
6iai-0.9 

Note that ol\ should be grater than 1.369 to keep Q2 positive. 

Let us vary the value of a\ from 1.521 to 9. The service time variation 
var s takes the values from 0.431 to 5.798. In the case at\ = 1.521 service 
time distribution is exponential one. In the optimal set of operation modes 
only one and three active robots are used. Figure 2 shows dependence of the 
cost criterion value on the threshold under different values of service time 
variation (ai G {1.521,3,5,7,9}). 




Figure 2: Dependence of the cost criterion on service time variation 

But when a± becomes larger than 4 (var s > 3.415), not the modes with 
one and three active robots but the modes with one and four active robots 
are the optimal set of exploited modes. Figure 3 shows the dependence of 
the cost criterion when only optimal set of modes is in use. In the figure, 
two lower curves correspond to the modes with one and three active robots 
and other curves correspond to the modes with one and four active robots. 

Now let us vary the variance of time until obsolescence in the same way. 
Let the matrix G have the form 
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Figure 3: Dependence of the cost criterion on service time variation 

and 7 = (0.9 0.1). To maintain the average time until obsolescence g\ = 5, 
the value «2 is related to a\ as 

O.lai 
giai - 0.9 

We vary a± from 0.2 to 7. Thus the variation takes the values from 25 to 
449. Note that in the case a\ = 0.2, we get the time until obsolescence 
distributed exponentially. The optimal strategy consists of modes with one 
and three active robots in all considered cases. Figure 4 shows dependence 
of the cost criterion value on variation of time until obsolescence {a\ € 
{0.2,0.5,1,3,7}). Note that as the variation grows the optimal threshold 
decreases from 2 to 0. 

Now let us consider the example based on real data obtained from the 
robot designed by INRIA Maestro team in the framework of RIAM INRIA- 
Canon Research Project. Data about the information delivery process by 
the robot to the data base was presented in the form of the text file which 
contained more than 65 000 timestamps defining epochs of information de- 
livery. 

This data was processed in the following way. 

• Inter-arrival times were computed. 

• The obtained sample was censored: very long intervals, which actually 
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Figure 4: Dependence of the cost criterion on variation of time until obso- 
lescence 

correspond to the periods when the crawling process was stopped due 
to some reason, are deleted from the sample. 

• The obtained sample was transformed into two samples in such a way 
as the very short inter-arrival times were deleted from the initial sam- 
ple and the corresponding information about the number of succes- 
sively deleted intervals was placed into another sample. As the result 
of these manipulations, we stated that the arrival process is the batch 
arrival process. One sample defines the intervals between the epochs 
of batches arrival, the second sample defines the number of the in- 
formation units in the corresponding batch. By information units we 
mean either the principal part of a Web page (e.g., Web page main 
html file) or its embedded resources (e.g., image and audio files). 

• Based on the second sample, distribution of the number of the infor- 
mation units in a batch was computed as follows. The size of a batch 
varies in the interval [1,8] and probabilities dk that the batch size is 
equal to k = 1, 2, . . . , 8 are the following: di = 0.040445227, d 2 = 
0.18404232, d 3 = 0.26405114, d 4 = 0.5057307, d 5 = 8.8163983 • 10~ 4 , 
d % = 7.7143486 -10" 4 , d 7 = 0.0018734847, d 8 = 0.00220405. The mean 
batch size is equal to 3.263389905223716. 

• Based on the first sample, we computed estimation of the mean value 
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of an interval between arrivals of batches as 212.03, estimation of the 
variance of such an interval as 51352.38 and the estimations of lag- 
k correlation of inter- arrival times for k equal to 1, 2, . . . , 6 are given 
by 0.622; 0.574; 0.555; 0.537; 0.523; 0.507. Thus, the flow defined 
by a sample under study has slowly decreasing correlation. In this 
situation, it is reasonable to apply method by Diamond and Alfa [1] 
oriented to such flows. 

As the result, the process of the arrival of the batches was defined by 
the MAP which is characterized by the matrices 

_ /-0.0038 
°~ [ -0.0066 



Di = A, A 



0.0037 6.58 • 10 
1.3 • 10~ 4 0.0064 



-5 



• Based on this MAP of batches and information about the batch size 
distribution, the BM AP of Web pages is constructed. It is defined by 
the matrices -Do and = d^A, k = 1, 8. 

To estimate a Web page service time distribution, the following informa- 
tion about the size of an arbitrary information unit delivered by a crawler 
(content size) in the used data base was exploited: the mean content size 
is equal to 49207.0356 bytes, the mean squared content size is equal to 
2.0527E+11, and the mean cubed content size is equal to 2.5773E+18. 
Based on this information, the service time distribution of a Web page was 
described, up to some normalizing constant defined by the content process- 
ing rate (in our application the processing rate is constant), by the hyper- 
exponential distribution which is the partial case of the PH distribution 
defined by the vector j3 = (0.00570.9943) and by sub-generator 

/-0.0014 
~ V -0.2409 

The mean service time is 8.2. The squared coefficient of variation of the 
service time distribution is equal to 86.03. Because the exponential distri- 
bution has the squared coefficient of variation equal to 1, is is clear that this 
distribution can not be considered as a good approximation of the service 
time distribution. 

We assume that the system has four available operation modes. The 
buffer capacity is K = 20. 
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When the r robots are activated, the BMAP-input is described by the 
matrices 

V^=rD k , fe = 0^,r = M, 

where the matrices D^, k = 0, 8, are defined above. The intensities of 
the BMAP when r robots are used, r = 1,4, are as follows: AW = 0.0153, 
A( 2 ) = 0.0307, A( 3 ) = 0.046, A< 4 > = 0.061. 

The distribution of a Web page obsolescence time is exponentially dis- 
tributed with parameter 0.0005, i.e., the mean time until obsolescence is 
equal to 2000. 

The cost criterion coefficients are taken as ci oss = 200, c Q b s = 250, a = 3, 

Crob = 20, C s tar = 600. 

The cost criterion values C r when r, r = 1, 4, robots are always activated 
are given by d = 666.28, C 2 = 657.07, C 3 = 639.03 and C 4 = 621.25. When 
all four modes of operations are exploited the optimal cost criterion value is 
563.51 and the optimal set of thresholds is [2,2,2], i.e., the optimal strategy 
assumes that four robots should be active until the number of Web pages 
in the system does not exceed 2. When the number of Web pages in the 
system exceeds 2, three robots should be deactivated and only one robot 
should be active. 

Table 5 contains the optimal values of the cost criterion for the different 
combinations of operation modes. 

Table 5: The Values of the Optimal Cost Criterion for the Fixed 
Combination of Operation Modes 
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possible numbers of active robots 


Optimal 
thresholds 


Optimal value 
of the cost 
criterion 


1 




666. ZS 


z 




657. U7 


3 




639. U3 


A 

4 




co 1 or 


1,2 


U 


024.97 


1,0 


n 

u 


79 


1,4 


2 


563.51 


2,3 


1 


622.81 


2,4 


2 


593.29 


3,4 


3 


609.66 


1,2,3 


0,0 


591.72 


1,2,4 


2,2 


563.51 


1,3,4 


2,2 


563.51 


2,3,4 


2,2 


593.29 


1,2,3,4 


2,2,2 


563.51 



The relative profit of operation mode control exceeds 9% comparing to the 
case when no control is applied and all four robots are always active. 



8 Conclusion 

In this paper we provide performance evaluation and optimization of the 
crawling part of a Web search engine. We model the crawler with a finite 
buffer, monotonically controlled arrival rate (controlled number of crawling 
robots), and with stochastically bounded waiting time. The system is con- 
sidered under rather general assumptions about the arrival process, service 
and obsolescence time distributions. Stationary distribution of the system 
state, sojourn time distribution and main performance measures of the sys- 
tem are calculated under any fixed set of thresholds defining the control 
strategy. This allows us to reduce the problem of optimal control to mini- 
mization of a known function of several integer variables. 

Numerical results are presented. They show that the dynamic input con- 
trol can give essential profit. Effects of buffer size changes and changes of 
average service and obsolescence times and their variances are investigated. 
In particular, we illustrate that the assumption about the exponential dis- 
tribution of service and obsolescence times can give poor estimation of the 
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system performance measures and optimal values of the thresholds when 
actually these times have high variation. 

The model has been applied to the performance evaluation and opti- 
mization of the crawler designed by INRIA Maestro team in the framework 
of the RIAM INRIA-Canon research project. 
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Table 3: Variable Mean Service Time 



g 




numbers of 
active robots 


7* 
J 


c* 


Ci 


Co 




Ca 


R, % 


0.1 


6.57 


1,3 





53.1 


58.58 


80.67 


104.74 


131.84 


9.35 


0.2 


3.29 


1,3 


1 


45.08 


59.45 


72.77 


94.16 


122.04 


24.17 


0.3 


2.19 


1,3 


1 


42.93 


68.51 


72.02 


89.42 


118.55 


37.34 


0.4 


1.64 


1,3 


1 


43.33 


80.36 


74.27 


86.7 


117.45 


41.66 


0.5 


1.31 


1,3 


1 


45.28 


93.09 


78.34 


85.15 


117.77 


42.2 


0.7 


0.94 


1,3 


2 


51.17 


117.98 


89.66 


84.65 


121.17 


39.55 


0.9 


0.73 


1,3 


2 


58.88 


140.09 


103.05 


87.14 


126.89 


32.43 


1 


0.66 


1,3 


2 


63.54 


149.91 


110 


89.4 


130.31 


28.93 


3 


0.22 


1,3 


3 


160.48 


244.04 


211.44 


182.29 


204.54 


11.96 


5 


0.13 


1,4 


1 


217.02 


272.19 


254.87 


243.01 


251.25 


10.7 


7 


0.09 


1,4 


1 


250.83 


285.26 


276.93 


274.4 


279.31 


8.59 


9 


0.07 


1,4 





272.76 


292.74 


290.05 


292.8 


297.61 


5.96 


11 


0.06 


1,4 





288.23 


297.59 


298.7 


304.77 


310.39 


3.15 


13 


0.05 


1,4 





299.92 


300.97 


304.82 


313.15 


319.78 


0.35 


15 


0.04 


1 




309.061 


303.47 


309.37 


319.33 


326.96 
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Table 4: Variable Mean Time until Obsolescence 



s 


9i 


3* 


C* 


Ci 


c 2 


c 3 


c 4 


R, % 


0.01 


500 


3 


52.07 


130.02 


92.47 


81.28 


118.93 


37.16 


0.1 


50 


3 


52.41 


132.22 


94.24 


81.99 


119.99 


36.08 


0.2 


25 


2 


53.83 


134.55 


96.17 


82.79 


121.17 


34.98 


0.3 


16.7 


2 


55.1 


136.78 


98.05 


83.6 


122.34 


34.09 


0.4 


12.5 


2 


56.36 


138.9 


99.88 


84.41 


123.51 


33.23 


0.5 


10 


2 


57.6 


140.93 


101.67 


85.24 


124.67 


32.43 


0.7 


7.1 


2 


60.02 


144.74 


105.12 


86.9 


126.95 


30.93 


0.9 


5.6 


2 


62.38 


148.25 


108.42 


88.57 


129.21 


29.57 


1 


5 


2 


63.54 


149.91 


110 


89.4 


130.31 


28.93 


3 


1.67 


1 


83.6 


173.68 


135.36 


105 


149.91 


20.38 


5 


1 


1 


95.64 


187.76 


152.48 


117.52 


165.21 


18.62 


10 


0.5 


1 


116.47 


206.98 


178.08 


138.25 


191.33 


15.75 


20 


0.25 





131.1 


223.17 


201.46 


158.66 


218.9 


17.37 


30 


0.17 





136.67 


230.45 


212.45 


168.67 


233.25 


18.97 


40 


0.13 





139.97 


234.54 


218.83 


174.63 


242.04 


19.85 


50 


0.1 





142.15 


237.18 


222.99 


178.58 


247.97 


20.4 


60 


0.083 





143.72 


239.03 


225.91 


181.39 


252.25 


20.77 


70 


0.071 





144.9 


240.38 


228.08 


183.5 


255.47 


21.04 


80 


0.063 





145.81 


241.42 


229.75 


185.13 


257.99 


21.24 


90 


0.056 





146.55 


242.24 


231.08 


186.44 


260.01 


21.4 


100 


0.05 





147.16 


242.91 


232.15 


187.51 


261.67 


21.52 


200 


0.025 





150.08 


245.97 


237.19 


192.58 


269.62 


22.07 



] 



34 



